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A new Monte Carlo method is proposed for fermion systems interacting with classical degrees 
of freedom. To obtain a weight for each Monte Carlo sample with a fixed configuration of classical 
variables, the moment expansion of the density of states by Chebyshev polynomials is applied 
instead of the direct diagonalization of the fermion Hamiltonian. This reduces a cpu time to scale 
as 0{N^i^ log A'^dim) compared to 0{N^i^^) for the diagonalization in the conventional technique; 
A^dim is the dimension of the Hamiltonian. Another advantage of this method is that parallel 
computation with high efficiency is possible. These significantly save total cpu times of Monte 
Carlo calculations because the calculation of a Monte Carlo weight is the bottleneck part. The 
method is applied to the double-exchange model as an example. The benchmark results show 
that it is possible to make a systematic investigation using a system-size scaling even in three 
dimensions within a realistic cpu timescale. 
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§1. Introduction 

Lattice fermion models interacting with classical de- 
grees of freedom have a wide range of application. In 
many realistic cases, localized spins or lattice distortions 
for instances are handled as classical degrees of freedom 
under the adiabatic approximation. In this work, we are 
interested in the cases where these interactions are pri- 
marily important and many-body interactions between 
fermions can be neglected. Although models are sim- 
plified in these cases, they are still very useful to dis- 
cuss physical properties in some realistic systems. For 
instance, many experimental results in perovskite Mn 
oxides have been successfully explained by the double- 
exchange (DE) model with classical localized t2g spinsid-* 

In such systems, the Hamiltonian is given by T-l{{xi}) 
and the partition function is written as 



(^-p[ni{x,})-txN]y (1) 



Z = TrcTrp exp 



where Trc and Trp are traces over classical degrees of 
freedom denoted by classical variables {xi} and fermion 
degrees of freedom, respectively. Here, (3 is the inverse 
temperature, fi is the chemical potential and N is the 
particle- number operator. 

A prescription to calculate the partition function in 
eq. is Monte Carlo (MC) sampling on configurations 
of classical degrees of freedom. In this method, fermion 
degrees of freedom are traced out beforehand as follows. 
The Hamiltonian 7i ({2;^}) defined for a fixed configu- 
ration of classical variables {xi} can be represented by 
A'dim X Afdim matrix where A'dim is proportional to the 
system size because there is no many-body interaction 
between fermions. Then the trace over fermion degrees 
of freedom is easily calculated by the diagonalization of 



the Hamiltonian matrix as 



Trpexpl-/? n{{x^})-nN 



n [l + exp(-/3(i?,({xJ)-A*))], 



(2) 



where E^, are eigenvalues of the Hamiltonian T-l{{xi}). 
Hence the problem is formally considered as a classical 
one in which the partition function can be written as 



Z = Trc cxp [-SoB {{xi})] 



(3) 



with the effective action 



5'cff {{xt}) 



E 



log [l + e^pi-f3{E,{{x,})-fi))]^ 



_ (4) 
Trace over classical variables in eq. is effectively cal- 
culated by a MC sampling. The MC update is per- 
formed using the Boltzmann weight of the configura- 
tion {xi} which is given by exp[—Scs{{xi})]. Config- 
urations of classical variables {xi} are updated by using 
the Metropolis algorithm. Because eq. (2) is positive 
definite, we do not have a negative sign problem in this 
fermionic MC calculation. 

Actual numerical calculation along the above conven- 
tional procedures is time-consuming and serious to han- 
dle large-sized systems. The bottleneck is the numerical 
diagonalization of the Hamiltonian 7i ({xi}). It costs a 
cpu time scaling as 0{N^^^^). If there are classical de- 
grees of freedom proportional to A^dim, as actually seen 
in many systems, one MC sweep to update all the values 
of {xi} totally scales as 0{N^^^). Let us consider the 
DE models for instance. We are especially interested in 
three-dimensional cases as realistic models for Mn oxides. 
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According to our estimation described later, a cpu time 
for 1, 000 MC sweeps in 6 x 6 x 6-sites system costs about 
10 days and that in 8 x 8 x 8 costs about 10 months on 
a standard workstation. This makes it extremely diffi- 
cult to make a systematic size-scaling analysis, especially 
in three dimensions. So far, MC studies have been per- 
formed mainly in one and two dimensions without the 
system-size scaling in many cases.BEl' 

In this work, we propose an alternative technique of 
MC calculations which is of great advantage to reduce 
the cpu time. In our method, the numerical diagonaliza- 
tion of the Hamiltonian is replaced by a moment expan- 
sion of the density of states by using orthogonal polyno- 
mials. This moment expansion costs a cpu time scaling 
as only 0{N^^^ log A'^dim) compared to 0(-/Vdim) the di- 
agonalization. Moreover, the procedure can be done by 
parallel computations whereas the parallelization is diffi- 
cult for the diagonalization in the conventional method. 
Efficiency of the parallelization is very high especially for 
large-sized systems. These significantly save a total cpu 
time for MC calculations by accelerating the bottleneck 
part to obtain the MC weights. 

The paper is organized as follows: In §||, the new al- 
gorithm is introduced. It is applied to DE models in §^ 
to demonstrate its advantages. The benchmark results 
for a cpu time are also shown in §|[ Sec. |^ is devoted to 
summary. 

§2. Algorithm 

In the conventional MC technique for the class of mod- 
els considered here, as mentioned in §^ the Hamiltonian 
is numerically diagonalized for each configuration oi {xi} 
to give all the eigenvalues. The effective action Scs is ex- 
actly calculated through eq. (^) by using these eigenval- 
ues. However, the exact eigenvalues contain more than 
enough information for the purpose of performing prac- 
tical MC calculations: It is sufhcient to know the density 
of states within required accuracy. 

In our algorithm, instead of the eigenvalues of the 
Hamiltonian, the density of states is estimated by us- 
ing a moment expansion with orthogonal polynomials. 
This technique has been originally formulated to handle 
huge-sized matrix numerically.a'l3) The Chebyshev poly- 
nomials are convenient in the moment expansion, which 
are defined recursively by 



m+1 



(x) = 2xT.m{x) - Trn-l{x), 



(5) 



with To = 1 and Ti — x for — 1 < x < 1. 

First, we define a renormalized Hamiltonian X{{xi}) 
whose eigenvalues are —l<e,y<lhy'H — aX + b with 

a ~ (Emax — £',„i„)/2, b = (ii^max + ^^min)/2, (6) 

where £^max (-Emin) is a highest (lowest) eigenvalue of the 
Hamiltonian Ti.. Then, the m-th Chebyshev moment of 
the density of states D{£) = j^— J^i, ^{^ ~ ^i^) is defined 
by 

Aim- J J,n{e)D{e)de. (7) 
Once the moments fj,m are calculated, the density of 



states is inversely obtained by 
1 



Die) 



ttVT 



(8) 



m>l 

The expectation value of an operator A is calculated by 

(9) 



(A) = / A{e)D{e)de = fiQi^a + 2 X! 

m>l 



where the moments of A is given by 



de 



■:A(e)T^{e) 



-1 7r-\/l — £■ 
In particular, the effective action in cq. 

by 



ScS — Moso + 2 



m>l 



(10) 

is obtained 
(11) 



where 



A'dimde 



log 



,-/3(ae+b-Ai) 



T™(e) 



(12) 



with the coefficients a and b in eq. (||). Eqs. (||) and (||) 
are straightforwardly confirmed Jsy using the orthogonal- 
ity of Chebyshev polynomials. qB 

From the definition (^, the moments of the density of 
states are calculated by 



Mm = 



— Tr{r™(x)} = ^ E (Hrm(x)i^ 

dim. '^dim 



(13) 

where \v) are convenient complete basis of the Hamil- 



tonian {(y\\v2) 



j). If we obtain the vectors 



\v\vn) = r„i(X)|zy), the calculation of a moment is a 
vector product; 



1 



A^di, 



E 



v; 0\v; m). 



(14) 



which costs a cpu time scaling as 0{Ndim)- The vectors 
m) are calculated recursively by using the relation (|^) 

as 



{v; m + 1) — 2X\v] m) — \v] m — 1). 



(15) 



For a sparse Hamiltonian, the matrix-vector product in 
eq. ( |l5| ) costs a cpu time scaling as only O(A^dim)- Fur- 
thermore, we use recursive relations of Chebyshev poly- 
nomials; 



2m 



2T' 



1, T, 



2m+l 



2T„iTm+l — Ti, 



(16) 



which enable us to obtain moments up to A/-th order 
from those only up to A//2-th order. A total cpu time 
to compute the moments /im up to the order of m = M 
scales as OiN^-^^M). 

The present method becomes 'exact', that is, equiv- 
alent to the direct diagonalization of the Hamiltonian 
in the conventional technique when we take the summa- 
tions in eqs. (^ and (^) up to infinite order. In actual 
calculations, we approximate the summations in eqs. (|^) 
and (|^) by finite summations up to m = il/. The ap- 
proximation by the truncation at a finite value of M 
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is controllable since we can always estimate the errors 
through comparison with 'exact' results obtained by the 
conventional technique within the range of system sizes 
we are interested in. In particular, the truncation error 
of the effective action, A5cft, which is a crucial quantity 
in MC updates, becomes exponentially small as a func- 
tion of M, as shown in §^ for DE models as an example. 
This comes from the fact that the moments in eq. 
( p^ ) become exponentially small to the value of m. This 
justifies our approximation even for small values of M . 

We discuss here iVdim dependence of the truncation 
number M which keeps AiSoff (the truncation error of 
S'off) small enough to perform MC calculations practi- 
cally. This is crucial for a cpu time. From the definition 
(^), AS'oft consists of the sum of errors for each eigen- 
value El,. Hence, if these exponentially-small errors are 
statistically independent of each other, AS'off is propor- 
tional to V-^dim- In the case of correlated errors, A^cff 
can be proportional to A'dim- For either case, therefore, 
to estimate ScS within required accuracy, the necessary 
truncation number M should be proportional to log A^dim 
because AS'oft ~ exp(— M). This indicates that a total 
cpu time to obtain moments for actual MC calculations 
scales as 0(-/V^im log^dim)- These properties of the trun- 
cation errors will be examined for DE models as an ex- 
ample in the next section. 

The original formulation of this moment-expansion 
technique has been proposed, bv sampling a few random 
basis for the sum in eq. (p^D'EP This reduces a cpu time 
to scale as 0(A'dim-^^-^), where / is the number of the 
random basis. However, this sampling method works 
well only for huge-sized matrix for which {N^i^J)~^/'^ 
is small enough. In our MC calculations, since the ma- 
trix sizes for the systems we are interested in are not so 
large (A'dim — 10^), errors are not small and accumulate 
through MC updates to lead wrong samplings. From 
this, we take a sum over complete basis in eq. (|l^). 

Another important point to reduce a cpu time is that 
the calculations of the moments in eq. ( |l3|) are com- 
pletely independent for each basis \v). This enables us 
to compute the sum over in a parallel fashion. In this 
algorithm, data to be communicated between different 
nodes are small compared to calculations in each node: 
A cpu time to calculate the moments on each processor is 
proportional to N^^^M / Np-^ whereas a time to commu- 
nicate the calculated moments to add up is proportional 
to MiVpE. Here, iVpE is the number of processors used 
in parallel calculations. This indicates that the paral- 
lel computation is performed very efficiently as far as 
(Adim/A'pE)^ is large. High efficiency of this paralleliza- 
tion is demonstrated in the last part of the next section. 
Since it is difficult to make an efhcient parallelization of 
the matrix diagonalization when all the eigenvalues are 
required, the present algorithm has another advantage 
in accelerating the calculation by parallelization. 

A cpu time to compute Scs is much shortened by both 
the reduction from O(iVdim) ^ 0(7V^i logTVdim) and the 
parallel calculation. As mentioned in §|l|, since the calcu- 
lation of ScS is the bottleneck in MC calculations, a total 
cost is much reduced; when there are 0(-/Vdim) classical 
variables to be updated, a cpu time for one MC sweep 



on all these variables scales as 0{N^i^ log A^^dimZ-^PE) hi 
our algorithm whereas that in the conventional method 
scales as 0{N^^^^^). 

§3. Application 

In this section, we show efficiency of the new algorithm 
introduced in §|| by an application to DE models with 
classical localized spins. The Hamiltonian is given byB' 



n 



E (4c 



h.c. 



(17) 



<iJ>,U 



where c\^{cia) creates (annihilates) a cr-spin electron at 
site i; cti is the spin operator whereas Si denotes the lo- 
calized spin at site i. We consider the nearest- neighbor 
hopping t and the ferromagnetic Hund's-rule coupling 
J > 0. Here the localized spin 5'^ is approximated 
as a classical rotator. The configuration of the clas- 
sical rotator is described by two angles in each site; 
Si = {cos 9 i cos 4>i, sm 9 i cos (j)i,sm(j)i). Then, the clas- 
sical variables {xi} in eq. (|l|) are 2N variables, {0^,0^} 
(i — 1, 2, • • •, N), in this model (pT|). N is the number of 
sites. The Hamiltonian for a fixed configuration {9i, (pi} 
can be written as 2N x 2N matrix. Hereafter we take 
the bandwidth = = 1 as an energy unit, where 
z ~ 2D is the coordination number in a Z)-dimensional 
hypercubic lattice. Throughout the paper we take J = 4 
which is an appropriate value to investiga|tje the model in 
comparison with Mn oxides experiments.EP 

This model (17) has been intensively studied to u 
derstand physical properties of perovskite Mn oxides. 
Some experimental results have been explained; such 
as the ferromagnetic metal in carrier-doped region, the 
transition to paramagnetic state by increasing tempera- 
ture and the negative magnetoresistance near the tran- 
sition. As far as we use the conventional MC technique 
described in §|^, it is difficult to study realistic three- 
dimensional cases based on the size-scaling analysis, be- 
cause of a diverging cpu-time as increasing the system 
sizes. So far, numerical studies h&Ke been performed 
mainly in one and two dimensions .BEP In the following, 
we show that the new algorithm formulated in reduces 
a cpu time significantly and makes it possible to inves- 
tigate much larger-sized systems than the conventional 
technique. 

First, we discuss the truncation error in the moment 
expansion. Figure 1 shows the truncation errors of the 
effective action when we truncate the summation in eq. 
( pr] ) at m = M. The errors are estimated as devia- 
tions from the results by the direct diagonalization of 
the Hamiltonian in the conventional method. The er- 
rors become exponentially small to the value of M as 
shown in the figures. The exponential decay depends 
on the temperature. Approximately, we find a relation; 
ASeff ~exp(-M//3). 

Although the necessary value of the truncation number 
M for practical MC is proportional to logA^dim as men- 
tioned in §H, actual values of M may depend on models 
and parameters. It should be determined in MC results 
for each case. We examine here the condition for M 
by changing temperatures in the DE model (p7[). Fig- 
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ure 2 shows M dependence of physical quantities. We 
calculate here the electron density and the spin struc- 
ture factor at the wave number A: = (ferromagnetic 
component) by the 1, 000 MC samplings with S^s whose 
moment expansion is truncated at to = M . In all the 
figures, the values by the conventional technique with 
direct diagonalization of the Hamiltonian are shown at 
1/M = because our results should agree with them in 
the limit of M ^ oo. For small values of M, the trunca- 
tion errors are so large that the estimated values deviate 
from those by the conventional method. However, for 
M ^ 40, the expectation values converge on those by 
the conventional method within statistical errorbars in 
the temperature range of 10 ^ /3 ;^ 50 which we are in- 
terested in. (The ferromagnetic transition temperature 
is estimated around /? = 20 by the D = oo technique. Q') 
The behavior is similar between data for different sizes. 
This is consistent with the weak A'^im-dependence of M 
(M ~ logiVdim) as mentioned in Therefore, in or- 
der to obtain these quantities of the present model eq. 
([l^ ) within this accuracy, the value of M can be taken 
at around 40 throughout the parameter range of our in- 
terests. 
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Finally, we show the benchmark results for a cpu time 
of our algorithm. Figure 3 shows the benchmark result 
on a workstation with Alpha-processor 21164 533MHz. 
A cpu time for MC sweeps on whole lattice sites in our 
technique is proportional to A'^^, whereas the conven- 
tional technique costs a time proportional to iV'*. When 
wc take the value of M as 40 based on the above obser- 
vation in Fig. 2, our algorithm becomes faster than the 
conventional one for <; 80 — 90 even in a single CPU. 
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Fig. 2. The truncation M dependence of MC estimations for 
physical quantities: (a) and (b) show the electron density, and 
(c) and (d) show the spin structure factor at the wave number 
fc = 0. (a) and (c) are the results for N = 16, and (b) and (d) 
are for A'^ = 64 in one dimension. The results are for J = 4 
and n = —3.5 by 1,000 MC sweeps on whole lattice sites. The 
circles, squares and triangles are for /3 = 10, 20 and 50, respec- 
tively. The filled symbols at 1/A/ = are the values obtained by 
the conventional MC technique using the diagonalization of the 
Hamiltonian. 



Fig. 1. The error of the effective action for a configuration of 
classical variables when the moment expansion is truncated at 
m = M; (a) A'^ = 16 and (b) Af = 64 in one dimension. We take 
J = 4 and /i = —4. The circles, squares and triangles correspond 
to the data for /3 = 10, 50 and 100, respectively. 



As mentioned in §y|, moreover, our algorithm has an 
advantage to be performed on a parallel computer. Fig- 
ure 4 shows efficiency of the parallelization. The parallel 
calculations have been performed using SR-2201 at the 
computer center of University of Tokyo. We find that 
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in Fig. 4 (a), a cpu time is reduced almost inversely 
proportional to the number of processors for large-sized 
systems. Figure 4 (b) shows the efficiency of the paral- 
lelization defined by 

(18) 



Here ^(A'^pe) is a cpu time on iVpE processors. Note 
that R takes a value between and f , and that R = 1 
when the parallelization is perfect. The parallelization 
is more effective for larger-sized systems for the reason 
mentioned in §|^; for instance, R remains around 0.9 for 
6x6x6 systems even when we use 64 processors. 




Fig. 3. The benchmark results for both our algorithm and the 
conventional one on a workstation with Alpha-processor 21164 
(533MHz). The results are for 100 MC sweeps on whole lattice 
sites. The filled diamonds are for the conventional technique. 
The dotted line is a fit by N'^. The circles and squares are for 
our algorithm with M = 4 and 40, respectively. The data are 
fitted by A^'^ as the gray lines. 



Table I summarizes estimated cpu times for various 
system sizes of DE models. Our algorithm is powerful 
enough to study three-dimensional systems with several 
different sizes within a realistic timescale. Physical prop- 
erties in three-dimensional DE models including the fer- 
romagnetic transition temperature will be reported else- 
where. 



Table I. A cpu time of 1, 000 MC sweeps for various system sizes 
of DE models in three dimensions. The time for the conventional 
algorithm is estimated on a work station with a single CPU of 
Alpha-processor 21164 (533MHz). The time for the present al- 
gorithm is estimated on SR2201 at the computer center of Uni- 
versity of Tokyo when we use 64 processors in parallel. 
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~ 10 days 


10 hours 


8x8x8 


~ 10 months 


~ 6 days 


10 X 10 X 10 


~ 12 years 


~ 6 weeks 
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Fig. 4. The benchmark results for our algorithm on the parallel 
computer, SR2201 at the computer center of University of Tokyo. 
The results are for 10 MC sweeps on whole lattice sites with 
M = 40. The circles, squares, triangles and crosses are for N = 
16, 32, 64 and 128 in one dimension, respectively. The diamonds 
show the results for Af = 6x6x6in three dimensions, (a) is 
the-number-of-nodes-dependence of a cpu time, (b) shows the 
efficiency of the parallelization. See text for details. The lines 
are guides to eye. 



§4. Summary 

We have proposed the new algorithm of Monte Carlo 
calculation for fermion systems interacting with classi- 
cal degrees of freedom under the adiabatic approxima- 
tion. The moment expansion of the density of states 
by orthogonal polynomials is used, instead of the direct 
diagonalization of the Hamiltonian, to obtain a Monte 
Carlo weight for a fixed configuration of classical vari- 
ables. Errors of the effective action by the truncation of 
the expansion become exponentially small with increas- 
ing the order of the expansion. This allows us to trun- 
cate the expansion at a small order in practical Monte 
Carlo calculations. These reduce a cpu time to scale as 
0(iV^;jjj log A^dim) from 0(iV|i,„) in the diagonahzation, 
where A'dim is the dimension of the Hamiltonian matrix. 
This moment-expansion method is controllable since we 
can always estimate the truncation errors through com- 
parison with results by the diagonalization. As another 
advantage, since the moment expansion is independent 
for each basis of the Hamiltonian, our new algorithm 
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can be performed on parallel computers. Efficiency of 
the parallelization is very high especially for larger-sized 
systems since data to be communicated between nodes 
become smaller compared to calculations in each node. 
The reduction of the order of N^im and the paralleliza- 
tion significantly accelerate the procedure to obtain a 
Monte Carlo weight. This leads to much reduction of a 
total cpu time for Monte Carlo since the calculation of 
the weight is the bottleneck part. 

In order to show the efficiency of our algorithm, we 
have applied it to the double-exchange model with classi- 
cal localized spins. We have examined effects of the trun- 
cation of the moment expansion in actual Monte Carlo 
calculations. The condition for the truncation is clarified 
in the parameter region of interest. The benchmark re- 
sults on both a single and multi cpu systems have been 
shown. Our algorithm is a powerful tool to investigate 
large-sized systems which are difficult to handle by the 
conventional technique within a realistic cpu timescale; 
for the double-exchange model, it may make possible to 
study three-dimensional systems systematically through 
system-size scaling analysis. 

There are many other applications of our algorithm; 
such as dauble-exchange models including Jahn- Teller 
distortionja) surface effects in double-exchange systems 
and some models for organic compounds. Our algorithm 
gives us a possible way to investigate these problems by 
using a numerical analysis for larger-sized systems 
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